Geographically Weighted Modeling of Financial Inclusion in Tanzania

Author

Federico Jose Rodriguez

Published

October 21, 2024

Modified

November 9, 2024

(WIP) We look into financial inclusion in Tanzania based on a 2023 survey of close to 10 thousand residents.

A. Getting Started

A.1 Background

The World Bank defines financial inclusion as the state of having access to useful and affordable financial products to meet one’s needs. Financial inclusion is an enabler to 7 of the Sustainable Development Goals, and is seen as the key enabler to reduce extreme poverty.

One key dimension of financial inclusion that the World Bank looked at in their latest Global Findex Database 2021 is the ownership of bank accounts for adults. In this report, 76% of the global adult population have their own accounts, but only 71% of the developing nations’ do. In some countries like Tanzania this number is even lower at 52%. Banking is just one traditional dimension. Other vehicles like mobile payments can bridge the gap in access to services for some of these nations.

Tanzania recognizes the importance of financial inclusion in promoting economic growth and with the Bank of Tanzania, the country’s central bank, the Microfinance Policy of 2000 was developed and focused on expanding financial services for low-income individuals.

The program behind financial inclusion has been structured from 2014 with the first National Financial Inclusion Framework for 2014-2016, with the latest version being the third framework for 2023-2028. While there has been significant progress, (e.g., access to financial services has risen from 42% in 2013 to 89% in 2023) the country continues to aim for inclusion for the whole population by increasing access, encouraging usage and enhancing the quality of financial services.

A.2 Objectives

Finscope Tanzania 2023 is a public-private sector collaboration and aimed, among others, to understand and describe the financial behavior of individuals in the country and to establish an updated view of the level of financial inclusion across various measures. A large part of the findings is showing the change (improvements) of the overall measures against the previous 2017 report.

The objective of this study is to build on the Finscope Tanzania 2023 by identifying influential variables and identifying if geospatial factors influence the effect of those variables.

In order to satisfy this, the specific deliverables for the study will be:

  • to build a global or non-spatial explanatory model for the level of financial inclusion across Tanzania;

  • to build a geographically weighted explanatory model for the same response variables; and,

  • to assess the advantage of the geographically weighted model and to analyse the geographically weighted model

A.3 Data Sources

The following data sources are used for this analysis:

  • Finscope Tanzania 2023 individual survey data from Finscope Tanzania

    • The dataset is contained in a csv and translates the responses from 9,915 individuals who answered the survey

    • The respondents are all adults aged 16 years and above take across Tanzania

    • The dataset also includes derived fields which include different indicators for financial inclusion based on different criteria

  • District-level boundaries in Tanzania as a shapefile from geoBoundaries.org portal

A.4 Importing and Launching R Packages

For this study, the following R packages will be used. A description of the packages and the code, using p_load() of the pacman package, to import them is given below.

The loaded packages include:

  • olsrr - for building OLS (ordinary least squares) regression models and performing diagnostic tests

  • GWmodel - for calibrating geographically weighted family of models

  • tmap - for plotting cartographic quality maps

  • ggstatsplot - for multivariate data visualization and analysis

  • sf - spatial data handling

  • tidyverse - attribute data handling

pacman::p_load(olsrr, sf, GWmodel, tmap, tidyverse, ggstatsplot, sfdep)

B. Data Loading and Preparation

B.1 Loading Tanzania District boundaries

We load the district level boundaries in the following code chunk using st_read() and indicating the appropriate layer name. (i.e., the level 2 map) We also use rename() to already change the shapeName field to district to make it more understandable. We also project the map onto EPSG 32737 using st_transform() in order to be able to reference distances in terms of metres.

tz_dist <- st_read(dsn="data/geospatial", 
                   layer="geoBoundaries-TZA-ADM2") %>%
  rename(district = shapeName) %>%
  st_transform(32737)
Reading layer `geoBoundaries-TZA-ADM2' from data source 
  `C:\drkrodriguez\ISSS626-GAA\Take-home\Take-home_Ex03\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 170 features and 5 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 29.58953 ymin: -11.76235 xmax: 40.44473 ymax: -0.983143
Geodetic CRS:  WGS 84
tz_dist
Simple feature collection with 170 features and 5 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -548018.9 ymin: 8698528 xmax: 658181.6 ymax: 9890194
Projected CRS: WGS 84 / UTM zone 37S
First 10 features:
       district shapeISO                 shapeID shapeGroup shapeType
1        Arusha     <NA> 72390352B32479700182608        TZA      ADM2
2  Arusha Urban     <NA> 72390352B90906351205470        TZA      ADM2
3        Karatu     <NA> 72390352B22674606658861        TZA      ADM2
4       Longido     <NA> 72390352B95731720096997        TZA      ADM2
5          Meru     <NA> 72390352B99598192663387        TZA      ADM2
6       Monduli     <NA> 72390352B11439822404473        TZA      ADM2
7    Ngorongoro     <NA> 72390352B42279830137418        TZA      ADM2
8         Ilala     <NA> 72390352B40584164885098        TZA      ADM2
9     Kinondoni     <NA> 72390352B66429416458525        TZA      ADM2
10       Temeke     <NA> 72390352B94835751472469        TZA      ADM2
                         geometry
1  MULTIPOLYGON (((262372 9603...
2  MULTIPOLYGON (((251788.2 96...
3  MULTIPOLYGON (((148006.1 96...
4  MULTIPOLYGON (((206258.1 96...
5  MULTIPOLYGON (((262372 9603...
6  MULTIPOLYGON (((226729.3 96...
7  MULTIPOLYGON (((160641.8 96...
8  MULTIPOLYGON (((530993 9249...
9  MULTIPOLYGON (((529848.2 92...
10 MULTIPOLYGON (((531400.6 92...

The output shows that there are 170 objects loaded which corresponds to individual districts. The object is also of multipolygon class which could indicate that there are districts with discontinuous land areas, like islands.

We can create a simple map to visualize the boundaries using qtm() from tmap.

tmap_mode("view")
tmap mode set to interactive viewing
qtm(tz_dist, text = "district", text.size = 0.4)
tmap_mode("plot")
tmap mode set to plotting

B.2 Deriving District centroids

Before we load the aspatial data, we will process the district boundary map to be able to use it for future operations with the other dataset. One step that needs to be done is to define representative points, which can be the centroids for the boundary map. The primary purpose of this is to be able to map the aspatial data for a district into a single location. In order to do this, the first step is to convert the multipolygon layer object into a polygon object which will allow for proper centroid calculations for each district.

We use the code chunk below to convert the sf object into polygons using st_cast() and then create a column for each individual polygon’s area using mutate() with st_area(). We then use groupby() to reduce back the object to one row per district and then filter() to keep only the largest polygon for each district.

tz_dist_poly <- tz_dist %>%
  st_cast("POLYGON") %>%
  mutate(area = st_area(.)) %>%
  group_by(district) %>%
  filter(area == max(area)) %>%
  ungroup() %>%
  select(-area) %>%
  select(district)
Warning in st_cast.sf(., "POLYGON"): repeating attributes for all
sub-geometries for which they may not be constant

We can produce a map with tmap package to see if the operation had any irregular effects on the geography. In order to see the difference between the original map and the polygon map, we add the original map as the first layer in red, and then overlay the polygon map. The areas which now appear red are the polygons or areas that have been excluded from the original map to the polygon map.

tm_shape(tz_dist) +
  tm_polygons("red") +
tm_shape(tz_dist_poly) +
  tm_polygons("grey") +
  tm_layout(title = "Full vs Poly Map",
            title.position = c("left", "bottom"))

From the output, we see that in addition to a few small islands, there are three inland areas that have been excluded from the original map. While this produces holes in the new map, this might not be a big concern right now as long as the centroids we get from the remaining geometries is meaningful.

In order to generate the centroids, we can now use st_centroid() to compute them across each district’s largest polygon.

tz_dist_centroids <- st_centroid(tz_dist_poly)
Warning: st_centroid assumes attributes are constant over geometries

We can check the location of the centroids by plotting them as a layer on top of the original district boundary layer using tmap package in the code below.

tm_shape(tz_dist) +
  tm_polygons("grey") +
tm_shape(tz_dist_centroids) +
  tm_dots("green", size = 0.2) +
  tm_layout(title = "District Centroids",
            title.position = c("left", "bottom"))

The centroid locations look mostly acceptable, with a few exceptions where they might be lying somewhere away from the district boundaries given some of the districts have odd, nonconvex shapes.

B.2 Loading Finscope Tanzania 2023 Respondent Data

We can use read_csv() in the code chunk below to load the raw respondent data into an R object.

fstz23 <- read_csv("data/aspatial/FinScope Tanzania 2023_Individual Main Data_FINAL.csv", show_col_types = FALSE)
Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)

There are 9915 rows or records, and 721 columns or fields. Most of these columns should not be relevant in meeting our objective, so it is advised to limit the data we work with to those meaningful variable. These variables should be our variable(s) of concern, or the dependent variable(s), and the variables that may contribute to it, or the independent variables.

B.4 Preparing Finscope Tanzania 2023 Respondent Data

B.4.1 Selecting potential variables

The first step in preparing the dataset is to reduce the data by keeping only the potentially relevant fields. This means identifying fields that can be used as is or to derive both the response variable(s) and the explanatory variables.

This is performed by scanning the datamap file (i.e., data dictionary) that accompanies the dataset. From there, we decide to keep only the following variables. We also change the variable names to a shorter and more recognizable one.

Variable Name Description New Variable Name
BANKED Classified as: Banked; Not Banked fi_banked
OVERALL_FORMAL Classified as using formal instruments: Yes, No fi_formal
INFORMAL Classified as using informal instruments: Yes, No fi_informal
Variable Description New Variable Name
reg_name Region name region
dist_name District name district
ward_name Ward name ward
clustertype Indicates if in rural or urban urban
Variable Name Description New Variable Name
c8c Age age
c9 Gender: male. or female female
c10 Marital status: married, divorced, widowed, single (4 levels) maritalstatus
c11 Highest level education (10 levels of values) education
c2 Head of household: respondent, not the respondent head_hh
c8n_a1 Visually impaired: Yes, No visual_impaired
c8n_b1 Hearing impaired: Yes, No hearing_impaired
c8n_c1 Communication impaired: Yes, No comm_impaired
c8n_d1 Movement impaired: Yes, No move_impaired
c8n_e1 Difficulty with daily activities: Yes, No daily_impaired
c8n_f1 Difficulty remembering and concentrating: Yes, No cogn_impaired
Variable Name Description New Variable Name
c12_1 Land ownership (6 levels) land_own
c14 Family involved in agriculture, fishing or aquaculture: Yes, No agricultural
c18_2 Primary source of funds (12 levels) source_of_funds
c27__17 Has some form of ID: Yes, No has_id
D2_1__1 Receives salary from regular job: Yes, No reg_job
D2_1__2 Receives money from selling goods produced: Yes, No production
D2_1__11 Does not receive income: Yes, No no_income
IncomeMain Derived variable for main source of income (14 levels) income_source
Variable Name Description New Variable Name
c23__1 Access to mobile phone: No, Yes mobile
c23__2 Access to internet: No, Yes internet
c24_1 Personally own mobile phone: No, Yes own_mobile

The code block below compiles the 29 variables and their new names in two separate objects. These will be used in the succeeding steps for data preparation.

colstokeep <- c("reg_name", "dist_name", "ward_name", "clustertype",
                "c8c", "c9", "c10", "c11", "c2",
                "c8n_a1", "c8n_b1", "c8n_c1", "c8n_d1", "c8n_e1", "c8n_f1",
                "c12_1", "c14", "c18_2",
                "c23__1", "c23__2", "c24_1", "c27__17",
                "D2_1__1", "D2_1__2", "D2_1__11",
                "BANKED", "OVERALL_FORMAL", "INFORMAL", "IncomeMain")
newnames <- c("region", "district", "ward", "urban",
              "age", "female", "maritalstatus", "education", "head_hh",
              "visual_impaired", "hearing_impaired", "comm_impaired",
              "move_impaired", "daily_impaired", "cogn_impaired",
              "land_own", "agricultural", "source_of_funds",
              "mobile", "internet", "own_mobile", "has_id",
              "reg_job", "production", "no_income",
              "fi_banked", "fi_formal", "fi_informal", "income_source")
length(colstokeep)
[1] 29
length(newnames)
[1] 29

We then use the code chunk below to keep the selected variables using select() and then to rename the variable or column names using colnames()

fstz23_sf <- fstz23 %>%
  select(all_of(colstokeep))

colnames(fstz23_sf) <- newnames

B.4.2 Recoding of variables

Recoding of variables is a data preparation step where variable values are replaced by another. This may be done for reasons like cleaning the data or standardising the data. This is generally performed in R using the recode() function.

B.4.2.1 Recoding of district names

We have district names in the map and in the survey data. As these are two different data sources, there is a chance that they mismatch. We need to ensure that they use the same names as we will use these to add the geographic information to the survey data.

We first check which names in each set do not have a corresponding match in the other. We can do this by performing a left join using left_join() and checking which elements do not have matches. We can the use filter() to find the records that did not return a valid value from the other dataset. The code chunk performs this left join approach twice as it needs to be checked for direction for each data source, and then we display the district names that are unmatched for each dataset.

mismatched_values <- tz_dist %>%
  left_join(fstz23_sf, by = "district") %>%
  filter(is.na(region)) %>%
  select(district)
mismatched_tzmap <- mismatched_values$district

mismatched_values <- fstz23_sf %>%
  left_join(tz_dist, by = "district") %>%
  filter(is.na(shapeType)) %>%
  select(district)
mismatched_fstz23 <- mismatched_values$district

list(
  mismatched_in_map = mismatched_tzmap,
  number_mm1 = length(c(mismatched_tzmap)),
  mismatched_in_survey = unique(mismatched_fstz23),
  number_mm2 = length(c(unique(mismatched_fstz23)))
)
$mismatched_in_map
 [1] "Arusha Urban"                 "Meru"                        
 [3] "Dodoma Urban"                 "Iringa Urban"                
 [5] "Mafinga Township Authority"   "Bukoba Urban"                
 [7] "Mpanda Urban"                 "Kasulu Township Authority"   
 [9] "Kigoma  Urban"                "Moshi Urban"                 
[11] "Lindi Urban"                  "Babati UrbanBabati Urban"    
[13] "Butiam"                       "Musoma Urban"                
[15] "Mbeya Urban"                  "Magharibi"                   
[17] "Morogoro Urban"               "Masasi  Township Authority"  
[19] "Mtwara Urban"                 "Makambako Township Authority"
[21] "Njombe Urban"                 "Kibaha Urban"                
[23] "Mafia"                        "Sumbawanga Urban"            
[25] "Songea Urban"                 "Kahama Township Authority"   
[27] "Shinyanga Urban"              "Singida Urban"               
[29] "Tunduma"                      "Tabora Urban"                
[31] "Handeni Mji"                  "Korogwe"                     
[33] "Korogwe Township Authority"   "Tanga Urban"                 

$number_mm1
[1] 34

$mismatched_in_survey
 [1] "Tanganyika"  "Kigamboni"   "Arumeru"     "Butiama"     "Dodoma"     
 [6] "Tanga"       "Malinyi"     "Kibiti"      "Magharibi B" "Magharibi A"
[11] "Ubungo"      "Tabora"     

$number_mm2
[1] 12

The output reveals that there are 34 district names in the boundary map that are not matched, while there are 12 in the survey data that are not matched. We do not need to ensure all 34 in the first dataset is matched as there might really be areas where there are no respondents or residents. We do, however, want almost all, if not all, of the records in the second dataset to be matched as this is where our modeling data sits. It is also worth noting that all the recoding for districts will be done on fstz23_sf.

We will perform checks and data cleaning for the 12 unmatched values in fstz23_sf, and we will also explore some of the remaining unmatched variables in tz_dist. We see that there are values which have “Urban” at the end so there might also be an opportunity to create matches for them.

For unmatched values, there might be differences in spellings between the two sources. While this doesn’t guarantee catching misspellings, we can at least check if there are other variables that share the first few letters with the unmatched value.

To find district values which contains “Tang” we can use str_detect() on both dataset’s columns.

to_find <- "Tanga"
unique(tz_dist_poly[str_detect(tz_dist_poly$district,to_find),]$district)
[1] "Tanga Urban"
unique(fstz23_sf[str_detect(fstz23_sf$district,to_find),]$district)
[1] "Tanganyika" "Tanga"     

There is only one value in tz_dist that contains “Tang” but two in fstz_23. It should be safe to assume that Tanga and Tanga Urban are one and the same. We can perform the recoding using recode() within mutate() for the district column of fstz_23.

fstz23_sf <- mutate(fstz23_sf, district = recode(district,
                            "Tanga" = "Tanga Urban"))

For Tanganyika, we can use the following code chunk to check how many survey records are affected using length(), and then show the wards and regions for the records that do have a district name of Tanganyika.

# Count number of records which has district name Tanganyika
count(fstz23_sf[str_detect(fstz23_sf$district,"Tanganyika"),])
# A tibble: 1 × 1
      n
  <int>
1    89
# Show regions and wards for records with district name Tanganyika
unique(select(fstz23_sf[str_detect(fstz23_sf$district,"Tanganyika"),], 
       c(region, district, ward)))
# A tibble: 6 × 3
  region district   ward    
  <chr>  <chr>      <chr>   
1 Katavi Tanganyika Mnyagala
2 Katavi Tanganyika Ikola   
3 Katavi Tanganyika Bulamata
4 Katavi Tanganyika Mishamo 
5 Katavi Tanganyika Sibwesa 
6 Katavi Tanganyika Isengule

There are six wards that all reflect the same region name of “Katavi”. Upon research, it appears that the Tanganyika district was recently formed and was part of the rural area of the district Mpanda. As such, we will recode Tanganyika to Mpanda using the same approach with

fstz23_sf <- mutate(fstz23_sf, district = recode(district,
                            "Tanganyika" = "Mpanda"))

We saw that there are two districts with the word Magharibi in the fstz23, and one in tz_dist. We can confirm this by using str_detect() to check for all district names containing “Mag” (so we also check some variation in spelling) in both datasets.

to_find <- "Magha"
unique(tz_dist_poly[str_detect(tz_dist_poly$district,to_find),]$district)
[1] "Magharibi"
unique(fstz23_sf[str_detect(fstz23_sf$district,to_find),]$district)
[1] "Magharibi B" "Magharibi A"

For this case, we drop the B and A by using recode() on the district column of ftsz23_sf.

fstz23_sf <- mutate(fstz23_sf, district = recode(district,
                            "Magharibi B" = "Magharibi",
                            "Magharibi A" = "Magharibi"))

We next look into the unmatched district “Arumeru” in fstz23. We will use the code chunk below for this district and most of the succeeding ones to check district names that match in both data sets, show the regions and wards in fstz23 for the district, and the number of records which reflect that district. For these, we continue using str_detect() which is included in tidyverse under the stringr package in order to find matches of a substring.

to_find <- "Arumeru"
print("Matches in tz_dist")
[1] "Matches in tz_dist"
unique(tz_dist_poly[str_detect(tz_dist_poly$district,to_find),]$district)
character(0)
print("Matches in ftsz23")
[1] "Matches in ftsz23"
unique(fstz23_sf[str_detect(fstz23_sf$district,to_find),]$district)
[1] "Arumeru"
count(fstz23_sf[str_detect(fstz23_sf$district,to_find),])
# A tibble: 1 × 1
      n
  <int>
1   105
unique(select(fstz23_sf[str_detect(fstz23_sf$district,to_find),], 
       c(region, district, ward, urban)))
# A tibble: 7 × 4
  region district ward         urban
  <chr>  <chr>    <chr>        <chr>
1 Arusha Arumeru  Poli         Urban
2 Arusha Arumeru  Maroroni     Rural
3 Arusha Arumeru  Olmotonyi    Rural
4 Arusha Arumeru  Oloirien     Urban
5 Arusha Arumeru  Maji ya Chai Rural
6 Arusha Arumeru  Kisongo      Rural
7 Arusha Arumeru  Nkoaranga    Rural

The output shows that there are no districts in tz_dist that have a name of Arumeru, but there are 105 in fstz23. Upon research, we see that the wards of Arumeru was split between Arusha and Meru. Among the wards in the dataset, the following are now part of Meru: Maroroni, Poli, Maji ya Chai, Nkoaranga. The balance 3 are Arusha: Olmotonyi, Oloirien, Kisongo.

We first update the records for the last three wards to reflect a district name of Arusha using the following code chunk.

fstz23_sf[(fstz23_sf$ward == "Olmotonyi" | fstz23_sf$ward == "Oloirien" | fstz23_sf$ward == "Kisongo"),]$district = "Arusha"

We can then use recode() to change the remaining records that are still reflecting “Arumeru” and change them to “Meru”

fstz23_sf <- mutate(fstz23_sf, district = recode(district,
                            "Arumeru" = "Meru"))

We execute the same code chunk that makes use of recode() to find records where the district name contains “Butiam”

to_find <- "Butiam"
print("Matches in tz_dist")
[1] "Matches in tz_dist"
unique(tz_dist_poly[str_detect(tz_dist_poly$district,to_find),]$district)
[1] "Butiam"
print("Matches in ftsz23")
[1] "Matches in ftsz23"
unique(fstz23_sf[str_detect(fstz23_sf$district,to_find),]$district)
[1] "Butiama"
count(fstz23_sf[str_detect(fstz23_sf$district,to_find),])
# A tibble: 1 × 1
      n
  <int>
1    45
unique(select(fstz23_sf[str_detect(fstz23_sf$district,to_find),], 
       c(region, district, ward, urban)))
# A tibble: 3 × 4
  region district ward    urban
  <chr>  <chr>    <chr>   <chr>
1 Mara   Butiama  Butiama Urban
2 Mara   Butiama  Bukabwa Rural
3 Mara   Butiama  Mirwa   Rural

We see that there is only one district from each dataset and it appears that they can only refer to the same district. We then use recode() to update “Butiama” to “Butiam”

fstz23_sf <- mutate(fstz23_sf, district = recode(district,
                            "Butiama" = "Butiam"))

We execute the same code chunk that makes use of recode() to find records where the district name contains “Dodom”

to_find <- "Dodom"
print("Matches in tz_dist")
[1] "Matches in tz_dist"
unique(tz_dist_poly[str_detect(tz_dist_poly$district,to_find),]$district)
[1] "Dodoma Urban"
print("Matches in ftsz23")
[1] "Matches in ftsz23"
unique(fstz23_sf[str_detect(fstz23_sf$district,to_find),]$district)
[1] "Dodoma"
count(fstz23_sf[str_detect(fstz23_sf$district,to_find),])
# A tibble: 1 × 1
      n
  <int>
1   142
unique(select(fstz23_sf[str_detect(fstz23_sf$district,to_find),], 
       c(region, district, ward, urban)))
# A tibble: 10 × 4
   region district ward            urban
   <chr>  <chr>    <chr>           <chr>
 1 Dodoma Dodoma   Msalato         Urban
 2 Dodoma Dodoma   Mnadani         Urban
 3 Dodoma Dodoma   Ntyuka          Urban
 4 Dodoma Dodoma   Mbabala         Urban
 5 Dodoma Dodoma   Nkuhungu        Urban
 6 Dodoma Dodoma   Nzuguni         Urban
 7 Dodoma Dodoma   Hombolo Bwawani Urban
 8 Dodoma Dodoma   Kikuyu Kusini   Urban
 9 Dodoma Dodoma   Mkonze          Urban
10 Dodoma Dodoma   Uhuru           Urban

We again see that there is a one-to-one matching for the sole district on both dataset. We will then update “Dodoma” to “Dodoma Urban” using recode() in the chunk below

fstz23_sf <- mutate(fstz23_sf, district = recode(district,
                            "Dodoma" = "Dodoma Urban"))

We execute the same code chunk that makes use of recode() to find records where the district name contains “Kibit”

to_find <- "Kibit"
print("Matches in tz_dist")
[1] "Matches in tz_dist"
unique(tz_dist_poly[str_detect(tz_dist_poly$district,to_find),]$district)
character(0)
print("Matches in ftsz23")
[1] "Matches in ftsz23"
unique(fstz23_sf[str_detect(fstz23_sf$district,to_find),]$district)
[1] "Kibiti"
count(fstz23_sf[str_detect(fstz23_sf$district,to_find),])
# A tibble: 1 × 1
      n
  <int>
1    29
unique(select(fstz23_sf[str_detect(fstz23_sf$district,to_find),], 
       c(region, district, ward, urban)))
# A tibble: 2 × 4
  region district ward   urban
  <chr>  <chr>    <chr>  <chr>
1 Pwani  Kibiti   Mbuchi Rural
2 Pwani  Kibiti   Bungu  Rural

Upon research, it appears that Kibiti is a relatively new district. The two wards that appear under it were actually part of Rufiji district, so we can change “Kibiti” to “Rufiji” using encode()

fstz23_sf <- mutate(fstz23_sf, district = recode(district,
                            "Kibiti" = "Rufiji"))

We execute the same code chunk that makes use of recode() to find records where the district name contains “Kigamb”

to_find <- "Kigamb"
print("Matches in tz_dist")
[1] "Matches in tz_dist"
unique(tz_dist_poly[str_detect(tz_dist_poly$district,to_find),]$district)
character(0)
print("Matches in ftsz23")
[1] "Matches in ftsz23"
unique(fstz23_sf[str_detect(fstz23_sf$district,to_find),]$district)
[1] "Kigamboni"
count(fstz23_sf[str_detect(fstz23_sf$district,to_find),])
# A tibble: 1 × 1
      n
  <int>
1    45
unique(select(fstz23_sf[str_detect(fstz23_sf$district,to_find),], 
       c(region, district, ward, urban)))
# A tibble: 3 × 4
  region        district  ward      urban
  <chr>         <chr>     <chr>     <chr>
1 Dar es Salaam Kigamboni Kibada    Urban
2 Dar es Salaam Kigamboni Kigamboni Urban
3 Dar es Salaam Kigamboni Somangila Urban

There is no indication of the district merging or splitting recently from another, but if we check our map, the region should occupy part of the space which appears as “Temeke”. We then recode Kigamboni as Temeke using the following code chunk.

fstz23_sf <- mutate(fstz23_sf, district = recode(district,
                            "Kigamboni" = "Temeke"))

We execute the same code chunk that makes use of recode() to find records where the district name contains “Malin”

to_find <- "Malin"
print("Matches in tz_dist")
[1] "Matches in tz_dist"
unique(tz_dist_poly[str_detect(tz_dist_poly$district,to_find),]$district)
character(0)
print("Matches in ftsz23")
[1] "Matches in ftsz23"
unique(fstz23_sf[str_detect(fstz23_sf$district,to_find),]$district)
[1] "Malinyi"
count(fstz23_sf[str_detect(fstz23_sf$district,to_find),])
# A tibble: 1 × 1
      n
  <int>
1    15
unique(select(fstz23_sf[str_detect(fstz23_sf$district,to_find),], 
       c(region, district, ward, urban)))
# A tibble: 1 × 4
  region   district ward     urban
  <chr>    <chr>    <chr>    <chr>
1 Morogoro Malinyi  Usangule Rural

If we check the map, the location of Malinyi falls in the region of Ulanga in our map. We again use recode() to change the district names to Ulanga.

fstz23_sf <- mutate(fstz23_sf, district = recode(district,
                            "Malinyi" = "Ulanga"))

We execute the same code chunk that makes use of recode() to find records where the district name contains “Tabor”

to_find <- "Tabor"
print("Matches in tz_dist")
[1] "Matches in tz_dist"
unique(tz_dist_poly[str_detect(tz_dist_poly$district,to_find),]$district)
[1] "Tabora Urban"
print("Matches in ftsz23")
[1] "Matches in ftsz23"
unique(fstz23_sf[str_detect(fstz23_sf$district,to_find),]$district)
[1] "Tabora"
count(fstz23_sf[str_detect(fstz23_sf$district,to_find),])
# A tibble: 1 × 1
      n
  <int>
1    45
unique(select(fstz23_sf[str_detect(fstz23_sf$district,to_find),], 
       c(region, district, ward, urban)))
# A tibble: 3 × 4
  region district ward     urban
  <chr>  <chr>    <chr>    <chr>
1 Tabora Tabora   Mbugani  Urban
2 Tabora Tabora   Kiloleni Urban
3 Tabora Tabora   Mwinyi   Urban

Given that there is only “Tabora Urban” in tz_dist, we should be able to update the district names in fstz23 using this.

fstz23_sf <- mutate(fstz23_sf, district = recode(district,
                            "Tabora" = "Tabora Urban"))

We execute the same code chunk that makes use of recode() to find records where the district name contains “Ubung”

to_find <- "Ubung"
print("Matches in tz_dist")
[1] "Matches in tz_dist"
unique(tz_dist_poly[str_detect(tz_dist_poly$district,to_find),]$district)
character(0)
print("Matches in ftsz23")
[1] "Matches in ftsz23"
unique(fstz23_sf[str_detect(fstz23_sf$district,to_find),]$district)
[1] "Ubungo"
count(fstz23_sf[str_detect(fstz23_sf$district,to_find),])
# A tibble: 1 × 1
      n
  <int>
1    88
unique(select(fstz23_sf[str_detect(fstz23_sf$district,to_find),], 
       c(region, district, ward, urban)))
# A tibble: 6 × 4
  region        district ward    urban
  <chr>         <chr>    <chr>   <chr>
1 Dar es Salaam Ubungo   Mabibo  Urban
2 Dar es Salaam Ubungo   Mbezi   Urban
3 Dar es Salaam Ubungo   Kimara  Urban
4 Dar es Salaam Ubungo   Msigani Urban
5 Dar es Salaam Ubungo   Goba    Urban
6 Dar es Salaam Ubungo   Manzese Urban

Upon checking, the location of Ubungo appears to be within the boundaries of the district Kinondoni in our map. We recode it as such with the following code chunk.

fstz23_sf <- mutate(fstz23_sf, district = recode(district,
                            "Ubungo" = "Kinondoni"))

We have removed all mismatches from fstz23 but have 28 unmatched district names in tz_dist. Among these are a few districts that are suffixed by “Urban”

mismatched_values <- tz_dist %>%
  left_join(fstz23_sf, by = "district") %>%
  filter(is.na(region)) %>%
  select(district)
mismatch_urban <- mismatched_values[str_detect(mismatched_values$district,"Urban"),]$district
mismatch_urban
 [1] "Arusha Urban"             "Iringa Urban"            
 [3] "Bukoba Urban"             "Mpanda Urban"            
 [5] "Kigoma  Urban"            "Moshi Urban"             
 [7] "Lindi Urban"              "Babati UrbanBabati Urban"
 [9] "Musoma Urban"             "Mbeya Urban"             
[11] "Morogoro Urban"           "Mtwara Urban"            
[13] "Njombe Urban"             "Kibaha Urban"            
[15] "Sumbawanga Urban"         "Songea Urban"            
[17] "Shinyanga Urban"          "Singida Urban"           

For these, we will assume they refer to the urban region of a given district. For example, Arusha Urban will cover the urban area of the Arusha district. Based on this assumption, we can go through ftsz23 to check for records for that district and are in the urban area and then map them to the values above. We use a for loop in the code chunk below to find records in fstz23 that match these conditions and then apply the suffixed district names as replacements. We use the word() function from stringr package in order to pick the first word and treat it as the “plain” district name.

for (i in mismatch_urban)
  {
    dist = word(i, 1)
    fstz23_sf[str_detect(fstz23_sf$district,dist) & fstz23_sf$urban == "Urban",]$district = i
  }

We can check for the remaining mismatches by rerunning the earlier code.

mismatched_values <- tz_dist %>%
  left_join(fstz23_sf, by = "district") %>%
  filter(is.na(region)) %>%
  select(district)
mismatched_tzmap <- mismatched_values$district

mismatched_values <- fstz23_sf %>%
  left_join(tz_dist, by = "district") %>%
  filter(is.na(shapeType)) %>%
  select(district)
mismatched_fstz23 <- mismatched_values$district

list(
  mismatched_in_map = mismatched_tzmap,
  number_mm1 = length(c(mismatched_tzmap)),
  mismatched_in_survey = unique(mismatched_fstz23),
  number_mm2 = length(c(unique(mismatched_fstz23)))
)
$mismatched_in_map
 [1] "Iringa Urban"                 "Mafinga Township Authority"  
 [3] "Kasulu Township Authority"    "Masasi  Township Authority"  
 [5] "Makambako Township Authority" "Mafia"                       
 [7] "Kahama Township Authority"    "Tunduma"                     
 [9] "Handeni Mji"                  "Korogwe"                     
[11] "Korogwe Township Authority"  

$number_mm1
[1] 11

$mismatched_in_survey
character(0)

$number_mm2
[1] 0

We were able to remove all mismatches from fstz23 and then reduce the mismatches from tz_dist from 34 to 11. We can visualize the distribution of the (recoded) records in our map by first adding the number of records into the sf object. We use count() in the code chunk below to compute for the number of records per district. We then use left_join() to merge it with the district map, and replace any zero values (from mismatches) with zero.

tz_dist_stat <- tz_dist %>%
  left_join(count(fstz23_sf, district), by= "district") %>%
  rename("orig_records" = "n") %>%
  replace(is.na(.),0)

We can then use tmap package to plot the distribution of respondents, We first add a layer in grey for the full map and then add a choropleth map for districts with nonzero number of respondents. This will show districts with no respondents as grey.

tm_shape(tz_dist)+
  tm_polygons("grey") +
tm_shape(tz_dist_stat[tz_dist_stat$orig_records > 0,]) +
  tm_polygons("orig_records", title = "Original Records")

The output shows that there are only a few districts that do not have any respondents. Most of the districts have between 0 to 50 respondents. The highest number of respondents in a single district is between 150 and 200.

B.4.2.2 Recoding of Modelling Variables

If we look at the data, the modelling variables that we have in fstz23 are mostly categorical. Most of these are binary, but some have more than two values. While we can use categorical variables for EDA, these do not work well once we start modelling. For our case, we will go ahead and convert most of these variables upfront.

We can use the following code chunk which uses the unique() function to display the distinct values. We use lapply() to run the function on each variable in the dataframe but we exclude the region, district, ward and age as these will either not be part of the modelling, or it is already in numeric form.

lapply(select(fstz23_sf,-c(region, district, ward, age)), unique)
$urban
[1] "Rural" "Urban"

$female
[1] "Female" "Male"  

$maritalstatus
[1] "Married/living together" "Widowed"                
[3] "Divorced/separated"      "Single/never married"   

$education
 [1] "Some primary"                             
 [2] "No formal education"                      
 [3] "Primary completed"                        
 [4] "Some secondary"                           
 [5] "Some University or other higher education"
 [6] "University or higher education completed" 
 [7] "Secondary competed-O level"               
 [8] "Post primary technical training"          
 [9] "Secondary completed-A level"              
[10] "Don’t know"                               

$head_hh
[1] "Respondent is hhh"  "Respondent not hhh"

$visual_impaired
[1] "Yes" "No" 

$hearing_impaired
[1] "No"  "Yes"

$comm_impaired
[1] "No"  "Yes"

$move_impaired
[1] "No"  "Yes"

$daily_impaired
[1] "No"  "Yes"

$cogn_impaired
[1] "No"  "Yes"

$land_own
[1] "You personally own the land/plot where you live" 
[2] "The land/plot is rented"                         
[3] "You own the land/plot together with someone else"
[4] "You don’t own or rent the land"                  
[5] "Other A household members owns the land/plot"    
[6] "Don’t know (Don’t read out)"                     

$agricultural
[1] "Yes" "No" 

$source_of_funds
 [1] "Your household sells some of its crops and uses the money"                                              
 [2] NA                                                                                                       
 [3] "Your household has to borrow money"                                                                     
 [4] "Your household has money to buy it, it uses money from wages / other regular job /  sources of income"  
 [5] "Help from friends/relatives/neighbors/community/Government"                                             
 [6] "Use savings the household has"                                                                          
 [7] "Your household sells non-agricultural things to get money"                                              
 [8] "Your household gets it from a buyer to whom it has to sell its crop, livestock or fish when it is ready"
 [9] "Your household does piece work/casual jobs to get money to buy it"                                      
[10] "Your household sells some of its livestock and uses the money"                                          
[11] "Your household gets it in exchange for work it does"                                                    
[12] "Your household doesn’t have to buy because it manage with what it has"                                  
[13] "Your household sells products like milk, eggs that it get from its livestock to get money to buy it"    

$mobile
[1] "Yes" "No" 

$internet
[1] "Yes" "No"  NA   

$own_mobile
[1] "Yes" "2"  

$has_id
[1] "No"  "Yes"

$reg_job
[1] "No"  "Yes"

$production
[1] "Yes" "No" 

$no_income
[1] "No"  "Yes"

$fi_banked
[1] "Not Banked" "Banked"    

$fi_formal
[1] "OVERALL_FORMAL"     "Not OVERALL_FORMAL"

$fi_informal
[1] "INFORMAL incl SACCO AND CMG RISK CONTRIBUTIONS"
[2] "Not INFORMAL"                                  

$income_source
 [1] "Farmers and fishers"                                         
 [2] "Piece work/casual labor"                                     
 [3] "Traders - non-agricultural"                                  
 [4] "Dependents"                                                  
 [5] "Service providers"                                           
 [6] "Traders - agricultural products"                             
 [7] "Welfare"                                                     
 [8] "Formal sector salaried"                                      
 [9] "Pension"                                                     
[10] "Informal sector salaried"                                    
[11] "Other"                                                       
[12] "Rental income"                                               
[13] "Gambling"                                                    
[14] "Interest from savings, investments, stocks, unit trusts etc."

The output reveals that 5 of the variables have more than 2 values or levels, while the balance 20 are binary. We also see that for the internet variable there are some records that show NA. We use the code below to check if each of the records have NA for internet by using is.na(), and then counting the number of records by just adding up the TRUE values using sum()

sum(is.na(fstz23_sf$internet))
[1] 7

As there are only seven with NA values, and there is no sure way of replacing them with the right value, removing them from the dataset should not produce any big issues. We use the code chunk below to remove the na’s from the internet variable. There are a number of different ways to remove na’s, here we just use a mask based on the complement of the results of the is.na() function which returns TRUE for any invalid values. We include the count of rows before and after running the code to check that there is a difference of seven rows.

nrow(fstz23_sf)
[1] 9915
fstz23_sf <- fstz23_sf[!is.na(fstz23_sf$internet),]
nrow(fstz23_sf)
[1] 9908

For binary variables, we will use the code below to replace the ‘positive’ value with 1 and the negative with 0. The values for each variable may vary so we need to apply the recoding individually for these variables.

fstz23_sf <- mutate(fstz23_sf,
                    urban = recode(urban, "Rural" = 0, "Urban" = 1),
                    female = recode(female, "Male" = 0, "Female" = 1),
                    head_hh = recode(head_hh, "Respondent not hhh" = 0, "Respondent is hhh" = 1),
                    visual_impaired = recode(visual_impaired, "No" = 0, "Yes" = 1),
                    hearing_impaired = recode(hearing_impaired, "No" = 0, "Yes" = 1),
                    comm_impaired = recode(comm_impaired, "No" = 0, "Yes" = 1),
                    move_impaired = recode(move_impaired, "No" = 0, "Yes" = 1),
                    daily_impaired = recode(daily_impaired, "No" = 0, "Yes" = 1),
                    cogn_impaired = recode(cogn_impaired, "No" = 0, "Yes" = 1),
                    agricultural = recode(agricultural, "No" = 0, "Yes" = 1),
                    mobile = recode(mobile, "No" = 0, "Yes" = 1),
                    internet = recode(internet, "No" = 0, "Yes" = 1),
                    own_mobile = recode(own_mobile, "2" = 0, "Yes" = 1),
                    has_id = recode(has_id, "No" = 0, "Yes" = 1),
                    reg_job = recode(reg_job, "No" = 0, "Yes" = 1),
                    production = recode(production, "No" = 0, "Yes" = 1),
                    no_income = recode(no_income, "No" = 0, "Yes" = 1),
                    fi_banked = recode(fi_banked, "Not Banked" = 0, "Banked" = 1),
                    fi_formal = recode(fi_formal, "Not OVERALL_FORMAL" = 0, "OVERALL_FORMAL" = 1),
                    fi_informal = recode(fi_informal, "Not INFORMAL" = 0,
                                         "INFORMAL incl SACCO AND CMG RISK CONTRIBUTIONS" = 1))

For the non-binary variables, we start by checking the distribution of the data across their different levels. Where there is a very small amount of data in one category, we will not be too worried merging them with another (logical) one. We use the code chunk below which uses count() to give the number of records or rows for each of the values of the indicated column. We then wrap the output in arrange() to sort the values in ascending number of records.

arrange(count(fstz23_sf, maritalstatus),n)
# A tibble: 4 × 2
  maritalstatus               n
  <chr>                   <int>
1 Widowed                   949
2 Divorced/separated        956
3 Single/never married     1934
4 Married/living together  6069
arrange(count(fstz23_sf, education),n)
# A tibble: 10 × 2
   education                                     n
   <chr>                                     <int>
 1 Don’t know                                    4
 2 Secondary completed-A level                  40
 3 Post primary technical training              55
 4 Some University or other higher education   125
 5 University or higher education completed    292
 6 Some secondary                              830
 7 Secondary competed-O level                 1273
 8 Some primary                               1354
 9 No formal education                        1592
10 Primary completed                          4343
arrange(count(fstz23_sf, land_own),n)
# A tibble: 6 × 2
  land_own                                             n
  <chr>                                            <int>
1 Don’t know (Don’t read out)                         14
2 The land/plot is rented                           1022
3 You own the land/plot together with someone else  1544
4 You don’t own or rent the land                    1820
5 Other A household members owns the land/plot      1974
6 You personally own the land/plot where you live   3534
arrange(count(fstz23_sf, source_of_funds),n)
# A tibble: 13 × 2
   source_of_funds                                                             n
   <chr>                                                                   <int>
 1 Your household gets it in exchange for work it does                        30
 2 Your household sells products like milk, eggs that it get from its liv…    32
 3 Your household doesn’t have to buy because it manage with what it has      43
 4 Help from friends/relatives/neighbors/community/Government                 67
 5 Your household has to borrow money                                        101
 6 Your household gets it from a buyer to whom it has to sell its crop, l…   146
 7 Your household sells non-agricultural things to get money                 184
 8 Your household sells some of its livestock and uses the money             304
 9 Your household has money to buy it, it uses money from wages / other r…   448
10 Use savings the household has                                             930
11 Your household does piece work/casual jobs to get money to buy it        1164
12 Your household sells some of its crops and uses the money                2325
13 <NA>                                                                     4134
arrange(count(fstz23_sf, income_source),n)
# A tibble: 14 × 2
   income_source                                                    n
   <chr>                                                        <int>
 1 Interest from savings, investments, stocks, unit trusts etc.     2
 2 Gambling                                                         6
 3 Rental income                                                   46
 4 Other                                                           67
 5 Pension                                                         67
 6 Welfare                                                         87
 7 Informal sector salaried                                       209
 8 Traders - agricultural products                                218
 9 Service providers                                              386
10 Formal sector salaried                                         426
11 Traders - non-agricultural                                     643
12 Dependents                                                    1960
13 Piece work/casual labor                                       2559
14 Farmers and fishers                                           3232

The output shows that there is a very large number of NAs in the source of funds. We highlight this using the is.na() function in the first code chunk below. As there are more than 40% missing values, and the income_source variable may already be holding the similar, but more complete, information, we will go ahead and drop the variable by using the select() function in the second code chunk.

sum(is.na(fstz23_sf$source_of_funds))
[1] 4134
fstz23_sf <- select(fstz23_sf, -c(source_of_funds))

While we will be preforming a separate EDA for the variables, we will already create binary variables for certain levels of the remaining categorical variables. Common practice is to produce n-1 dummy variable for a variable with n-levels, however, we will opt to consolidate some of the variables together where it makes sense. For martial status, we can keep three levels. The first variable denotes whether the respondent is currently married, the second variable will be whether the respondent is widowed, separated or divorced. Single respondents should reflect a value of zero for both of the variables. For the code chunks, we use as.integer() to convert the logical outputs into zeros and ones.

fstz23_sf$is_married <- as.integer(fstz23_sf$maritalstatus == "Married/living together")
fstz23_sf$was_married <- as.integer(fstz23_sf$maritalstatus == "Widowed" | fstz23_sf$maritalstatus == "Divorced/separated")

For education, we will keep four levels for primary, secondary and tertiary or higher education. (with the fourth level denoting that the respondent has not completed primary)

fstz23_sf$educ_primary <- as.integer(fstz23_sf$education == "Primary completed" | fstz23_sf$education == "Some secondary" | fstz23_sf$education == "Post primary technical training")
fstz23_sf$educ_secondary <- as.integer(fstz23_sf$education == "Secondary competed-O level" | fstz23_sf$education == "Secondary completed-A level" | fstz23_sf$education == "Some University or other higher education")
fstz23_sf$educ_tertiary <- as.integer(fstz23_sf$education == "University or higher education completed")

For land ownership, we define start with four levels to denote whether the respondent personally owns the land, the land is owned by family or shared with someone, the land is rented, or the land is neither owned nor rented.

fstz23_sf$land_self_own <- as.integer(fstz23_sf$land_own == "You personally own the land/plot where you live")
fstz23_sf$land_hh_or_shared <- as.integer(fstz23_sf$land_own == "You own the land/plot together with someone else" | fstz23_sf$land_own == "Other A household members owns the land/plot")
fstz23_sf$land_rented <- as.integer(fstz23_sf$land_own == "The land/plot is rented")

For sources of income, we see that the largest group is “Farmers and Fishers” (3232), “Piece-work or Casual Labor” (2559) and “Dependents” (1960). We will keep these three as distinct levels. We can then define traders (861), salaried (635) and all other sources excuding welfare, gambling, pension, and service providers. We will take service providers to be very similar to casual labor as it also counts as an irregular source of funds.

fstz23_sf$income_farm_and_fish <- as.integer(fstz23_sf$income_source == "Farmers and fishers")
fstz23_sf$income_piecework <- as.integer(fstz23_sf$income_source == "Piece work/casual labor" | fstz23_sf$income_source == "Service providers")
fstz23_sf$income_dependent <- as.integer(fstz23_sf$income_source == "Dependents")
fstz23_sf$income_trader <- as.integer(fstz23_sf$income_source == "Traders - non-agricultural" | fstz23_sf$income_source == "Traders - agricultural products")
fstz23_sf$income_salaried <- as.integer(fstz23_sf$income_source == "Formal sector salaried" | fstz23_sf$income_source == "Informal sector salaried")
fstz23_sf$income_other <- as.integer(fstz23_sf$income_source == "Other" | fstz23_sf$income_source == "Rental income" | fstz23_sf$income_source == "Interest from savings, investments, stocks, unit trusts etc.")

These recoded variables will now be used for model calibrations instead of the original variables.

B.4.3 Creation of overall measures

There are currently three different variables for financial inclusion looking at three different dimensions of financial inclusion. We can create an overall financial inclusion variable to indicate if the respondent is included in any of the three different dimensions of FI.

We use the following code to create a new variable which returns 1 if any of the three variables is 1, otherwise it returns zero. As the three original variables are in zero and one, we can use the logical or operator (|) to implement this operation. We then use as.integer() to convert the result from logical to a zero-one integer.

fstz23_sf$fi_overall <- as.integer(fstz23_sf$fi_banked | fstz23_sf$fi_formal | fstz23_sf$fi_informal)

We can also do the same for the different categories of physical impairment and create a single variable that combines it all.

fstz23_sf$any_impaired <- as.integer(fstz23_sf$visual_impaired | fstz23_sf$hearing_impaired | fstz23_sf$comm_impaired |
                                       fstz23_sf$move_impaired | fstz23_sf$daily_impaired)

B.4.4 Converting fstz23_sf into an sf dataframe

The object fstz23_sf still does not include any geospatial information and cannot be used later for geographically weighted modelling. To solve this, we can use the district centroids as the point location of the respondents. We first use left_join() to import the point geometry of the centroids, and then we use st_as_sf() in order to make sure that the new object is recognized as an sf dataframe.

fstz23_sf <- left_join(fstz23_sf, tz_dist_centroids, by = "district") %>%
  st_as_sf()

We can check if the geometries are properly mapped by plotting the respondents onto the boundary map using tmap package.

tm_shape(tz_dist) +
  tm_polygons("grey") +
tm_shape(fstz23_sf) +
  tm_dots("blue", size = 0.2) +
  tm_layout(title = "Respondents",
            title.position = c("left", "bottom"))

The respondents appear to be properly mapped to the district centroids, however, duplicate locations will be unacceptable for the methods we will use for geographically weighted modelling later. To solver this, we can slightly shift the points by introducing st_jitter(). For the amount argument, we use a value of 1000 which means that points will be shifted by up to 1km from their original point. This 1km should not be an issue and will not cause points to go beyond the district boundary.

fstz23_sf <- st_jitter(fstz23_sf, 1000)

We can doublecheck that the operation is successful and there are no duplicated locations by using duplicated() to check if a value is duplicated and then using any() to check if the function returned true for any value.

any(duplicated(fstz23_sf$geometry))
[1] FALSE

The code hase returned FALSE so we are assured that there are no duplicated point locations.

C. Exploratory Data Analysis

Before calibrating the model, we will run use visualization and statistical techniques to go through the potential modeling variables to try to extract some more insights.

C.1 Respondent Age

The age variable in the survey data is the only numeric variable retained. We can use the following code to produce a histogram to show the distribution of values of this variable. We use ggplot package to produce the chart, but we include the central measures– the mean, median, mode, as captions for additional insights.

ggplot(fstz23_sf, aes(x = age)) +
  geom_histogram(binwidth = 5, fill = "#4A90E2", color = "black") +
  labs(title = "Age Distribution of Respondents",
       x = "Respondent Age",
       y = "Number of Respondents",
       caption = paste("Mean =", round(mean(fstz23_sf$age, na.rm = TRUE), 1), 
                       ", Median =", round(median(fstz23_sf$age, na.rm = TRUE), 1), 
                       ", Mode =", as.numeric(names(sort(table(fstz23_sf$age), decreasing = TRUE)[1])))) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
    axis.title.x = element_text(size = 12),
    axis.title.y = element_text(size = 12),
    axis.text = element_text(size = 10)
  )

We can also display the summary statistics using the summary() function.

summary(fstz23_sf$age)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  16.00   27.00   37.00   39.68   50.00  100.00 

The outputs show that the ages range from 16 to 100 and is right skewed. The distribution has a mean of ~40yrs and a mode of 30 yrs. Given the shape of the distribution, we can consider scaled versions of the variable when we calibrate the model later.

xxx

C.2 Financial Inclusion Measures

We currently have four variables that give an indication of whether the respondent is financially included. For this study, we want to limit to one, or at most two variables. We expect that some of the variables are highly correlated, while some will perform much better in a model than others.

We first check the overall distribution or proportion of respondents across these four measures. We use ggplot package to create a bar chart to show the proportion of respondents achieving financial inclusion based on each dimension. The first part of the code computes for the proportion numbers as plotting the data directly will result in counts rather than percentages.

# Calculate the proportions for each variable
proportions <-  st_drop_geometry(fstz23_sf) %>%
  summarise(
    fi_banked = mean(fi_banked, na.rm = TRUE) * 100,
    fi_formal = mean(fi_formal, na.rm = TRUE) * 100,
    fi_informal = mean(fi_informal, na.rm = TRUE) * 100,
    fi_overall = mean(fi_overall, na.rm = TRUE) * 100
  )

# Convert the proportions to a long format for ggplot2
proportions_long <- proportions %>%
  pivot_longer(cols = everything(), names_to = "variable", values_to = "value")

# Create the bar chart
ggplot(proportions_long, aes(x = variable, y = value, fill = variable)) +
  geom_bar(stat = "identity", color = "black") +
  geom_text(aes(label = round(value, 1)), vjust = -0.5, size = 4) +
  scale_y_continuous(labels = scales::percent_format(scale = 1)) +
  labs(title = "Proportion of Respondents for Financial Inclusion Variables",
       x = "",
       y = "Percentage",
       fill = "Variable") +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
    axis.title.y = element_text(size = 12),
    axis.text = element_text(size = 10),
    legend.position = "none"
  )

The output shows that only 20.5% of the respondents are banked, the lowest across the three main dimensions. It also shows that the difference between formal and overall financial inclusion is a difference of 7%. This means that only 7% of respondents are banked or using informal FI instruments, but are not using formal instruments.

We can visualize the correlation based on overlapping values across the three dimensions. Visually, overlaps can be visualized using Venn diagrams, but we can also use an upset chart from UpSetR package. This visualization is more scalable than venn diagrams, which is not really an issue since we only have three categories. This chart makes it much easier to compare intersections and non-intersections against each other.

In the code chunk below, we load the UpSetR package, and then prepare the data so that we only have the three variables. The preparation ensures that the variables are in 0-1 integers which is the required format for the function. We then use upset() from UpSetR package to produce the chart by passing the data and defining the variables to be plotted.

library(UpSetR)

# Create a binary dataframe
overlap_data <- as.data.frame(st_drop_geometry(fstz23_sf)) %>%
  mutate(
    Banked = as.integer(fi_banked == 1),
    Formal = as.integer(fi_formal == 1),
    Informal = as.integer(fi_informal == 1)
  ) %>%
  select(Banked, Formal, Informal)

# Create the UpSet plot
upset(overlap_data, sets = c("Banked", "Formal", "Informal"),
      keep.order = TRUE, order.by = "freq", number.angles = 45,
      main.bar.color = "blue", sets.bar.color = "red",
      text.scale = c(1.5, 1.5, 1.5, 1, 1.5, 1.5),
      mainbar.y.label = "Intersection Size", sets.x.label = "Set Size")

The resulting chart shows that:

  1. All banked respondents are also financial included based on formal instruments. (bank is a subset or category under formal instruments)
  2. There are 663 respondents (~7%) that are financially included based on informal instruments, but not based on formal instruments
  3. There are 2812 respondents (~28%) that are financially included based on formal instruments, but not based on informal instruments
  4. There are 3836 respondents (~38%) that are financially included based on both formal and informal instruments

We do not have enough to already exclude any of these variables, but we expect that the overall measure covers too much of the sample to be predictable. We also expect that the informal FI measure has too much overlap with formal so we would likely choose one but not the other.

C.3 Correlation of predictors

We can check if any of the predictors are highly correlated as the precence of autocorrelation affects the performance and interpretation of the model.

We can do this by producing the correlation plot of the potential predictor variables. We use ggcorrmat() of ggstatsplot package to produce this. We pass a dataframe with just the predictor variables and the code will output a diagonal matrix with the correlation coeficients between each pair of variables.

ggcorrmat(select(st_drop_geometry(fstz23_sf),
                 -c(region, district, ward, maritalstatus, education, land_own, income_source,
                    fi_banked, fi_formal, fi_informal, fi_overall)))

The code outputs a large matrix, but we only need to focus on pairs where the correlation coefficient is high. (i.e., > 0.8) Those are:

  • age and age_standardized = 1 - This is expected as one is just a transformation. We will keep them both for now as we want to see which one will result to a better model.

  • any_impaired and visual_impaired = 0.81 - This is also expected as any_impaired is a derived variable. It looks like most of the impairment reported is visual in nature. We will then drop the derived variable

  • income_salaried and reg_job = 0.95 - This appears to be redundant variables and likely refer to the same condition. We should be able to drop the latter

  • production and income_farm_and_fish = 0.71 - while not as high as the last pair, this pair most likely also refers to the same type of work. We will follow the same approach so we only keep the variables prefixed by income in the calibration

Based on those, we can clean up our dataset by dropping the three variables mentioned above. We can also drop the variables we don’t need which include the categorical variables that we already have created new variables for. We perform this with the use of the select() function and using a “-” to exclude rather than select columns.

fstz23_sf <- select(fstz23_sf,-c(any_impaired, reg_job, production,
                                 maritalstatus, education, land_own, income_source))

D. Global (NonSpatial) FI Model

Before we calibrate a geographically weighted model, we first calibrate a global model or models that do not take the geographic locations into account.

D.1 Global Models without Variable Selection

To calibrate the model, we use glm() which calibrates generalised linear models. As the dependent variable is binary, we need to make sure that the model used is a logistic regression model. This is done by passing the value “binomial” to the family argument.

We will run this with all variables for all four FI measures to see if any of them are performing very well or very poorly against the others. We will then focus on finetuning and then preparing the geographically weighted models on those variables that can be best explained with this approach.

D.1.1 Global Model for Formal FI (no variable selection)

The code below calibrates a model with all variables with fi_formal as the dependent variable. We then use summary() to output the results. While the results display the AIC as a measure, we also compute for the reduction in variance due to the predictors. This is done by comparing the deviance and the null deviance. A higher number means that the variables had a larger contribution in predicting or explaining the value of the dependent variable.

fi_formal.lr <- glm(fi_formal ~ urban + age + female + head_hh + visual_impaired + hearing_impaired + comm_impaired +
                      move_impaired + daily_impaired + cogn_impaired + agricultural + mobile + internet +
                      own_mobile + has_id + no_income + is_married + was_married+
                      educ_primary + educ_secondary + educ_tertiary +
                      land_self_own + land_hh_or_shared + land_rented +
                      income_farm_and_fish + income_piecework + income_dependent + income_trader + income_salaried +
                      income_other,
                    family = "binomial",
                    data = fstz23_sf)
summary(fi_formal.lr)

Call:
glm(formula = fi_formal ~ urban + age + female + head_hh + visual_impaired + 
    hearing_impaired + comm_impaired + move_impaired + daily_impaired + 
    cogn_impaired + agricultural + mobile + internet + own_mobile + 
    has_id + no_income + is_married + was_married + educ_primary + 
    educ_secondary + educ_tertiary + land_self_own + land_hh_or_shared + 
    land_rented + income_farm_and_fish + income_piecework + income_dependent + 
    income_trader + income_salaried + income_other, family = "binomial", 
    data = fstz23_sf)

Coefficients:
                      Estimate Std. Error z value Pr(>|z|)    
(Intercept)          -0.909394   0.313346  -2.902 0.003705 ** 
urban                 0.632501   0.080982   7.810 5.70e-15 ***
age                   0.005744   0.002538   2.263 0.023640 *  
female                0.046569   0.074759   0.623 0.533335    
head_hh               0.178742   0.087275   2.048 0.040557 *  
visual_impaired      -0.024436   0.095515  -0.256 0.798080    
hearing_impaired     -0.291409   0.158199  -1.842 0.065469 .  
comm_impaired         0.275147   0.337988   0.814 0.415603    
move_impaired         0.072813   0.122261   0.596 0.551477    
daily_impaired        0.111153   0.201332   0.552 0.580887    
cogn_impaired         0.144262   0.159558   0.904 0.365924    
agricultural          0.030786   0.084129   0.366 0.714411    
mobile                0.549352   0.082138   6.688 2.26e-11 ***
internet              0.579452   0.081901   7.075 1.49e-12 ***
own_mobile            1.902444   0.070026  27.168  < 2e-16 ***
has_id               -0.675854   0.089887  -7.519 5.52e-14 ***
no_income            -0.583239   0.122630  -4.756 1.97e-06 ***
is_married           -0.034435   0.098156  -0.351 0.725726    
was_married          -0.194584   0.134307  -1.449 0.147392    
educ_primary          0.663236   0.064346  10.307  < 2e-16 ***
educ_secondary        1.492458   0.128363  11.627  < 2e-16 ***
educ_tertiary         2.714657   0.723016   3.755 0.000174 ***
land_self_own         0.170538   0.095351   1.789 0.073690 .  
land_hh_or_shared     0.081153   0.085445   0.950 0.342232    
land_rented           0.375301   0.137679   2.726 0.006412 ** 
income_farm_and_fish -0.692323   0.268336  -2.580 0.009878 ** 
income_piecework     -0.791856   0.269060  -2.943 0.003250 ** 
income_dependent     -0.874045   0.275270  -3.175 0.001497 ** 
income_trader        -0.026406   0.300939  -0.088 0.930079    
income_salaried      -0.157785   0.344056  -0.459 0.646518    
income_other         -0.660914   0.403607  -1.638 0.101522    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 11074.6  on 9907  degrees of freedom
Residual deviance:  7563.1  on 9877  degrees of freedom
AIC: 7625.1

Number of Fisher Scoring iterations: 7
1 - summary(fi_formal.lr)$deviance / summary(fi_formal.lr)$null.deviance 
[1] 0.3170735

D.1.2 Global Model for Banked FI (no variable selection)

The code below calibrates a model with all variables with fi_banked as the dependent variable.

fi_banked.lr <- glm(fi_banked ~ urban + age + female + head_hh + visual_impaired + hearing_impaired + comm_impaired +
                      move_impaired + daily_impaired + cogn_impaired + agricultural + mobile + internet +
                      own_mobile + has_id + no_income + is_married + was_married+
                      educ_primary + educ_secondary + educ_tertiary +
                      land_self_own + land_hh_or_shared + land_rented +
                      income_farm_and_fish + income_piecework + income_dependent + income_trader + income_salaried +
                      income_other,
                    family = "binomial",
                    data = fstz23_sf)
summary(fi_banked.lr)

Call:
glm(formula = fi_banked ~ urban + age + female + head_hh + visual_impaired + 
    hearing_impaired + comm_impaired + move_impaired + daily_impaired + 
    cogn_impaired + agricultural + mobile + internet + own_mobile + 
    has_id + no_income + is_married + was_married + educ_primary + 
    educ_secondary + educ_tertiary + land_self_own + land_hh_or_shared + 
    land_rented + income_farm_and_fish + income_piecework + income_dependent + 
    income_trader + income_salaried + income_other, family = "binomial", 
    data = fstz23_sf)

Coefficients:
                      Estimate Std. Error z value Pr(>|z|)    
(Intercept)          -2.037948   0.289783  -7.033 2.03e-12 ***
urban                 0.487211   0.070234   6.937 4.01e-12 ***
age                   0.008988   0.002742   3.278 0.001046 ** 
female               -0.244823   0.073536  -3.329 0.000871 ***
head_hh               0.429123   0.084337   5.088 3.62e-07 ***
visual_impaired       0.220218   0.092799   2.373 0.017641 *  
hearing_impaired     -0.545221   0.215985  -2.524 0.011592 *  
comm_impaired         0.168385   0.482018   0.349 0.726839    
move_impaired        -0.322852   0.143402  -2.251 0.024361 *  
daily_impaired        0.103399   0.261231   0.396 0.692242    
cogn_impaired        -0.163345   0.189727  -0.861 0.389267    
agricultural         -0.177993   0.076526  -2.326 0.020024 *  
mobile               -0.054909   0.136147  -0.403 0.686723    
internet              0.628532   0.063778   9.855  < 2e-16 ***
own_mobile            0.856049   0.122809   6.971 3.16e-12 ***
has_id               -1.180586   0.146446  -8.062 7.53e-16 ***
no_income            -0.091093   0.158109  -0.576 0.564522    
is_married           -0.082966   0.090487  -0.917 0.359204    
was_married          -0.397742   0.126719  -3.139 0.001697 ** 
educ_primary          0.796899   0.086567   9.206  < 2e-16 ***
educ_secondary        1.700194   0.107407  15.829  < 2e-16 ***
educ_tertiary         3.121148   0.208886  14.942  < 2e-16 ***
land_self_own         0.272392   0.094898   2.870 0.004100 ** 
land_hh_or_shared     0.162165   0.093024   1.743 0.081288 .  
land_rented           0.190437   0.108746   1.751 0.079909 .  
income_farm_and_fish -1.599462   0.207346  -7.714 1.22e-14 ***
income_piecework     -1.950615   0.209724  -9.301  < 2e-16 ***
income_dependent     -1.893223   0.225727  -8.387  < 2e-16 ***
income_trader        -1.546829   0.219310  -7.053 1.75e-12 ***
income_salaried      -0.644514   0.226550  -2.845 0.004442 ** 
income_other         -1.345573   0.301468  -4.463 8.07e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 10056.8  on 9907  degrees of freedom
Residual deviance:  7550.3  on 9877  degrees of freedom
AIC: 7612.3

Number of Fisher Scoring iterations: 6
1 - summary(fi_banked.lr)$deviance / summary(fi_banked.lr)$null.deviance 
[1] 0.2492412

D.1.3 Global Model for Informal FI (no variable selection)

The code below calibrates a model with all variables with fi_informal as the dependent variable.

fi_informal.lr <- glm(fi_informal ~ urban + age + female + head_hh + visual_impaired + hearing_impaired + comm_impaired +
                      move_impaired + daily_impaired + cogn_impaired + agricultural + mobile + internet +
                      own_mobile + has_id + no_income + is_married + was_married+
                      educ_primary + educ_secondary + educ_tertiary +
                      land_self_own + land_hh_or_shared + land_rented +
                      income_farm_and_fish + income_piecework + income_dependent + income_trader + income_salaried +
                      income_other,
                    family = "binomial",
                    data = fstz23_sf)
summary(fi_informal.lr)

Call:
glm(formula = fi_informal ~ urban + age + female + head_hh + 
    visual_impaired + hearing_impaired + comm_impaired + move_impaired + 
    daily_impaired + cogn_impaired + agricultural + mobile + 
    internet + own_mobile + has_id + no_income + is_married + 
    was_married + educ_primary + educ_secondary + educ_tertiary + 
    land_self_own + land_hh_or_shared + land_rented + income_farm_and_fish + 
    income_piecework + income_dependent + income_trader + income_salaried + 
    income_other, family = "binomial", data = fstz23_sf)

Coefficients:
                     Estimate Std. Error z value Pr(>|z|)    
(Intercept)          -0.85492    0.22360  -3.823 0.000132 ***
urban                 0.06146    0.05511   1.115 0.264748    
age                  -0.01826    0.00197  -9.270  < 2e-16 ***
female                0.09939    0.05470   1.817 0.069206 .  
head_hh               0.50994    0.06308   8.084 6.29e-16 ***
visual_impaired       0.17875    0.07094   2.520 0.011748 *  
hearing_impaired      0.03574    0.13339   0.268 0.788769    
comm_impaired        -0.33005    0.30810  -1.071 0.284058    
move_impaired         0.05833    0.09647   0.605 0.545371    
daily_impaired       -0.19622    0.17527  -1.120 0.262915    
cogn_impaired         0.09270    0.12797   0.724 0.468830    
agricultural          0.10476    0.05926   1.768 0.077070 .  
mobile                0.43162    0.07583   5.691 1.26e-08 ***
internet              0.40287    0.05307   7.591 3.19e-14 ***
own_mobile            0.31775    0.06393   4.970 6.68e-07 ***
has_id               -0.46332    0.07398  -6.263 3.77e-10 ***
no_income            -0.52889    0.10286  -5.142 2.72e-07 ***
is_married            0.52809    0.06802   7.764 8.24e-15 ***
was_married           0.21085    0.09467   2.227 0.025933 *  
educ_primary          0.20200    0.05155   3.919 8.90e-05 ***
educ_secondary        0.39907    0.07977   5.003 5.64e-07 ***
educ_tertiary         0.63482    0.16032   3.960 7.50e-05 ***
land_self_own        -0.18105    0.06966  -2.599 0.009349 ** 
land_hh_or_shared     0.13379    0.06473   2.067 0.038751 *  
land_rented           0.02514    0.08747   0.287 0.773773    
income_farm_and_fish  0.20099    0.17616   1.141 0.253888    
income_piecework      0.13475    0.17689   0.762 0.446201    
income_dependent     -0.20958    0.18466  -1.135 0.256397    
income_trader         0.45674    0.18825   2.426 0.015254 *  
income_salaried       0.42832    0.19652   2.180 0.029291 *  
income_other          0.32673    0.26233   1.246 0.212942    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 13683  on 9907  degrees of freedom
Residual deviance: 12417  on 9877  degrees of freedom
AIC: 12479

Number of Fisher Scoring iterations: 4
1 - summary(fi_informal.lr)$deviance / summary(fi_informal.lr)$null.deviance 
[1] 0.09256485

D.1.4 Global Model for Overall FI (no variable selection)

The code below calibrates a model with all variables with fi_overall as the dependent variable.

fi_overall.lr <- glm(fi_overall ~ urban + age + female + head_hh + visual_impaired + hearing_impaired + comm_impaired +
                      move_impaired + daily_impaired + cogn_impaired + agricultural + mobile + internet +
                      own_mobile + has_id + no_income + is_married + was_married+
                      educ_primary + educ_secondary + educ_tertiary +
                      land_self_own + land_hh_or_shared + land_rented +
                      income_farm_and_fish + income_piecework + income_dependent + income_trader + income_salaried +
                      income_other,
                    family = "binomial",
                    data = fstz23_sf)
summary(fi_overall.lr)

Call:
glm(formula = fi_overall ~ urban + age + female + head_hh + visual_impaired + 
    hearing_impaired + comm_impaired + move_impaired + daily_impaired + 
    cogn_impaired + agricultural + mobile + internet + own_mobile + 
    has_id + no_income + is_married + was_married + educ_primary + 
    educ_secondary + educ_tertiary + land_self_own + land_hh_or_shared + 
    land_rented + income_farm_and_fish + income_piecework + income_dependent + 
    income_trader + income_salaried + income_other, family = "binomial", 
    data = fstz23_sf)

Coefficients:
                      Estimate Std. Error z value Pr(>|z|)    
(Intercept)          -0.055692   0.333888  -0.167 0.867527    
urban                 0.516681   0.086899   5.946 2.75e-09 ***
age                  -0.001405   0.002650  -0.530 0.595996    
female                0.024066   0.078557   0.306 0.759332    
head_hh               0.235402   0.093558   2.516 0.011866 *  
visual_impaired       0.150577   0.103060   1.461 0.144000    
hearing_impaired     -0.114633   0.163713  -0.700 0.483798    
comm_impaired         0.033323   0.328379   0.101 0.919171    
move_impaired         0.007001   0.126653   0.055 0.955920    
daily_impaired       -0.002848   0.200755  -0.014 0.988681    
cogn_impaired         0.204732   0.165965   1.234 0.217357    
agricultural          0.043097   0.087983   0.490 0.624249    
mobile                0.517871   0.081228   6.376 1.82e-10 ***
internet              0.554273   0.091495   6.058 1.38e-09 ***
own_mobile            1.524108   0.074932  20.340  < 2e-16 ***
has_id               -0.592184   0.090867  -6.517 7.17e-11 ***
no_income            -0.620211   0.119034  -5.210 1.88e-07 ***
is_married            0.254598   0.101106   2.518 0.011798 *  
was_married           0.096682   0.140398   0.689 0.491059    
educ_primary          0.563723   0.068504   8.229  < 2e-16 ***
educ_secondary        1.393584   0.141706   9.834  < 2e-16 ***
educ_tertiary         2.338111   0.722148   3.238 0.001205 ** 
land_self_own         0.108214   0.101402   1.067 0.285892    
land_hh_or_shared     0.210526   0.088977   2.366 0.017979 *  
land_rented           0.319084   0.146909   2.172 0.029857 *  
income_farm_and_fish -0.616538   0.290245  -2.124 0.033654 *  
income_piecework     -0.684558   0.290886  -2.353 0.018605 *  
income_dependent     -1.042797   0.295604  -3.528 0.000419 ***
income_trader        -0.036161   0.329672  -0.110 0.912657    
income_salaried      -0.187959   0.374037  -0.503 0.615305    
income_other         -0.262658   0.473406  -0.555 0.579014    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 9339.8  on 9907  degrees of freedom
Residual deviance: 6817.2  on 9877  degrees of freedom
AIC: 6879.2

Number of Fisher Scoring iterations: 7
1 - summary(fi_overall.lr)$deviance / summary(fi_overall.lr)$null.deviance
[1] 0.2700872

D.1.5 Choosing a dependent variable based on global model calibration

The different models showed that the one using fi_formal and fi_overall as the dependent variable performed better than the other two. Since fi_formal gave the best reduction in deviance, and we have seen earlier that the other fi measures overlap with fi_formal anyway, we will focus on fi_formal as the main indicator for which we will finetune the model.

D.2 Variable selection for the global model

From the model results, we see that not all the variables contribute in the same degree as the others. We can use the output to pick the significant variables or perform a technique like forward, backward or stepwise regression to select variables by introducing or removing them one at a time.

Forward regression can be done using ols_step_forward_p() of the olsrr package. The function takes in the full model and starts from an empty model and adds variables with the highest significance one at a time. This continues doing this as long as variables with significance less than the specified p-value can be added.

fi_formal_fw_mlr <- ols_step_forward_p(fi_formal.lr, p_val = 0.05, details = FALSE)

We can also display the results by calling the resulting object.

fi_formal_fw_mlr

                                      Stepwise Summary                                      
------------------------------------------------------------------------------------------
Step    Variable             AIC          SBC             SBIC            R2       Adj. R2 
------------------------------------------------------------------------------------------
 0      Base Model        11452.392    11466.795     -5643875908.953    0.00000    0.00000 
 1      own_mobile         8001.162     8022.765    -11332019050.848    0.29427    0.29420 
 2      has_id             7826.777     7855.582    -11742615711.514    0.30673    0.30659 
 3      urban              7661.916     7697.921    -12144725901.394    0.31830    0.31810 
 4      mobile             7569.414     7612.621    -12378471273.642    0.32478    0.32450 
 5      internet           7492.544     7542.952    -12576969271.789    0.33013    0.32979 
 6      no_income          7432.878     7490.487    -12734351633.418    0.33429    0.33388 
 7      educ_secondary     7382.226     7447.036    -12870263096.019    0.33781    0.33735 
 8      educ_primary       7285.271     7357.282    -13129780806.723    0.34439    0.34386 
 9      educ_tertiary      7255.572     7334.784    -13213906836.776    0.34649    0.34589 
 10     age                7243.236     7329.649    -13252042393.875    0.34743    0.34677 
 11     income_trader      7234.121     7327.736    -13281646286.889    0.34817    0.34744 
 12     head_hh            7230.229     7331.044    -13297292381.943    0.34855    0.34776 
------------------------------------------------------------------------------------------

Final Model Output 
------------------

                         Model Summary                           
----------------------------------------------------------------
R                       0.590       RMSE                  0.348 
R-Squared               0.349       MSE                   0.121 
Adj. R-Squared          0.348       Coef. Var            46.241 
Pred R-Squared          0.347       AIC                7230.229 
MAE                     0.249       SBC                7331.044 
----------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                                 ANOVA                                  
-----------------------------------------------------------------------
                Sum of                                                 
               Squares          DF    Mean Square       F         Sig. 
-----------------------------------------------------------------------
Regression     642.088          12         53.507    441.188    0.0000 
Residual      1200.065        9895          0.121                      
Total         1842.153        9907                                     
-----------------------------------------------------------------------

                                    Parameter Estimates                                     
-------------------------------------------------------------------------------------------
         model      Beta    Std. Error    Std. Beta      t        Sig      lower     upper 
-------------------------------------------------------------------------------------------
   (Intercept)     0.236         0.017                 14.240    0.000     0.204     0.269 
    own_mobile     0.388         0.011        0.388    36.683    0.000     0.367     0.408 
        has_id    -0.114         0.011       -0.094    -9.947    0.000    -0.136    -0.091 
         urban     0.061         0.008        0.067     7.537    0.000     0.045     0.077 
        mobile     0.108         0.012        0.088     8.912    0.000     0.084     0.132 
      internet     0.053         0.009        0.055     6.158    0.000     0.036     0.069 
     no_income    -0.098         0.013       -0.066    -7.601    0.000    -0.123    -0.072 
educ_secondary     0.162         0.013        0.133    12.745    0.000     0.137     0.187 
  educ_primary     0.099         0.009        0.115    11.673    0.000     0.083     0.116 
 educ_tertiary     0.134         0.023        0.053     5.924    0.000     0.090     0.179 
           age     0.001         0.000        0.028     2.850    0.004     0.000     0.001 
 income_trader     0.042         0.013        0.028     3.321    0.001     0.017     0.067 
       head_hh     0.019         0.008        0.022     2.426    0.015     0.004     0.035 
-------------------------------------------------------------------------------------------

The results show that the global model for fi_formal includes 12 explanatory variables which consist of variables for:

  • Mobile phone usage and ownership and internet access (3 variables) - has the highest combined weight

  • Education (3 variables)

  • Other positive coefficient: urban, age, trader, head_hh

  • Negative coefficients: no source of income, no id

The last variable is surprising as it implies that having an id is linked to a lower probability of being financially included. The model did not pick up gender which means that it doesn’t see a clear distinction between males and females for this dimension of financial inclusion.

D.3 Testing for spatial autocorrelation

Before calibrating the geographically weighted model, we can check if there is a pattern linking global model performance and the respondent’s location. One way to do this is to plot the residuals or error and check for any patterns.

The first step is to export relevant output from the model as a dataframe. We just extract the residuals by referencing the model object in the results.

mlr_output <- as.data.frame(fi_formal_fw_mlr$model$residuals) %>%
  rename('FW_MLR_RES' = 'fi_formal_fw_mlr$model$residuals')

We then import these into fstz23_sf as a new variable MLR_RES using cbind(). This function adds the dataframe as new columns in their current order.

fstz23_sf <- cbind(fstz23_sf,
                         mlr_output$FW_MLR_RES) %>%
  rename('MLR_RES' = 'mlr_output.FW_MLR_RES')

We remember that our respondent data only have the district s their location so dissplaying them as points might not be very meaningful or accurate. We can instead plot the residuals at a district level to see if there are any patterns arising from there.

To do this, we first compute for the average residual at a district level by using group_by() to summarise the object by district and then define the aggregate function using summarise(). We use st_drop_geometry() so that the geometry column is dropped and we only have the district name and the average residual value.

avg_res_df <- st_drop_geometry(fstz23_sf) %>%
  group_by(district) %>%
  summarise(avg_res = mean(MLR_RES, na.rm = TRUE))

head(avg_res_df)
# A tibble: 6 × 2
  district                 avg_res
  <chr>                      <dbl>
1 Arusha                    0.105 
2 Arusha Urban              0.0168
3 Babati                   -0.0522
4 Babati UrbanBabati Urban -0.0185
5 Bagamoyo                  0.0395
6 Bahi                     -0.179 

We then export these average residual values into the TZ boundary map by using left_join() on the district name.

tz_dist_stat <- tz_dist_stat %>%
  left_join(avg_res_df, by= "district")

We can now use tmap package to visually display the average residual value per district. We again use two layers, and then exclude any districts where there is no residuals by using a mask with !is.na()

tm_shape(tz_dist)+
  tm_polygons("grey") +
tm_shape(tz_dist_stat[!is.na(tz_dist_stat$avg_res),]) +
  tm_polygons("avg_res", title = "Average Residuals")
Variable(s) "avg_res" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.

The map shows some apparent clusters with positive (average) residuals and some with negative residuals. A large number of districts have relatively low average residuals between -0.1 and 0.1.

We can confirm this observation by running global Moran’s I test on the average residual value. In order to do this, we first need to compute for the neighbors and the weights for each of the district. We use st_knn() to derive neighbors using knn method with a parameter of 6 neighbors, and then use equal weights for neighbors using the st_weights() function. We perform these functions on the centroids object as these methods work on points rather than shapes.

tz_dist_centroids <- tz_dist_centroids %>%
  mutate(nb = st_knn(geometry, k = 6,
                     longlat = FALSE),
         wt = st_weights(nb,
                         style = "W"),
         .before = 1)

We then run the Moran’s I test with permutations using global_moran_perm() from sfdep package. Note that we use the avg_res from the map object but have kept the neighbor list and weights in the centrodis object. The nsim argument indicates that we are running 10 simulations for this test. We also replace any na values with zero as the code will not work with any na values.

set.seed(1234)
global_moran_perm(replace_na(tz_dist_stat$avg_res,0),
                  tz_dist_centroids$nb,
                  tz_dist_centroids$wt,
                  alternative = "two.sided",
                  nsim = 99)

    Monte-Carlo simulation of Moran I

data:  x 
weights: listw  
number of simulations + 1: 100 

statistic = 0.19584, observed rank = 100, p-value < 2.2e-16
alternative hypothesis: two.sided

As the results are significant, the test confirms that there is spatial autocorrelation for the average residual values across districts. The positive test statistic confirms our observation that the pattern is that of clustering. We should then build geographically weighted models as they are likely to produce better results as we account for the respondents’ locations.

E. Geographically Weighted FI Model

E.1 Computing a bandwidth

The first step in calibrating a geographically weighted model is determining the bandwidth to use. The choice can either be a fixed bandwidth which is based on distance, or an adaptive bandwidth which is based on the number of neighbors. For our case, we will opt for a fixed bandwidth since we have not precisely mapped the locations of the respondents, and there is a wide range of values for the number of respondents per district. A fixed bandwidth is likely to ensure that it captures most, if not all, of the points in the same district and also some in neighboring districts.

To compute for the optimum fixed bandwidth, we use bw.ggwr() of GWModel package. The approach argument defines the stopping rule to be used, which is cross validation in this case. Setting the adaptive argument to FALSE indicates that we are computing for the fixed bandwidth. Like the global model, we indicate binomial for the family argument to specify we are calibrating a logistic regression model.

The first part of the function is the formula for the model. To be sure of the details to put in here, one may use the formula() function on the model object of the forward regression output to display the final formula.

bw_fixed <- bw.ggwr(formula = fi_formal ~ own_mobile + has_id + urban + mobile + internet +
                     no_income + educ_secondary + educ_primary + educ_tertiary +
                     age + income_trader + head_hh,
                  data = fstz23_sf,
                  family = "binomial",
                  approach = "CV",
                  kernel = "gaussian",
                  adaptive = FALSE,
                  longlat = FALSE)

# To display the global model's formula, you may use
# formula(fi_formal_fw_mlr$model)

The output shows a recommended bandwidth of ~99.5km should be used. We save the output as an rds object to save our results and prevent the need to rerun the code again.

write_rds(bw_fixed, "data/rds/bw_fixed.rds")

E.2 Deriving the distance matrix

Calibration of the logistics regression model requires a properly set up distance matrix. We use the code below which computes for this using gw.dist() on the coordinates of the data points. The coordinates function does not work on an sf dataframe so we convert it into Spatial format first.

distMAT <- gw.dist(dp.locat=
                     coordinates(as_Spatial(fstz23_sf)))

E.3 Calibrating the fixed bandwidth model

We can now calibrate the geographically weighted model using the computed bandwidthby using ggwr.basic() from GWModel. The function uses mostly the same arguments as the previous one, with the exception that the bandwidth now becomes an input here.

gwr_fixed <- ggwr.basic(formula = fi_formal ~ own_mobile + has_id + urban + mobile + internet +
                     no_income + educ_secondary + educ_primary + educ_tertiary +
                     age + income_trader + head_hh,
                     data = fstz23_sf,
                     family = "binomial",
                     kernel = "gaussian",
                     bw = bw_fixed,
                     adaptive = FALSE,
                     longlat = FALSE,
                     dMat = distMAT)

We again save this object into an rds file to save our work for future runs.

write_rds(gwr_fixed, "data/rds/gwr_fixed.rds")